This week we are formally introducing the ggplot2 package, which is essentially the graphics creation workhorse of the tidyverse. The lecture portion of the session steps through the logic of ggplot2 and explains that it is designed to standardize and streamline data visualization.



Let’s practice a bit using a rather timely dataset showing selected information related to the Super Bowl 🏈 Take a moment to import the file called superbowl_dataset.csv and familiarize yourself with its contents. Hint: You will need to use the read_csv function to import superbowl_dataset.csv from the data folder (i.e., “./data/superbowl_dataset.csv”).


The foundation for data visualization in R is the ggplot2 package that you read about today.

Basically, we use the ggplot function to create a plot object, then add one or more of the geom_ functions to give geometry to what would otherwise be an empty plot object. We will also see later, than we can use familiar dplyr functions to manipulate a data object, then pipe the result (%>%) directly into the ggplot function.

Important dplyr Verbs To Remember

dplyr verbs Description
select() select columns
filter() filter rows
arrange() re-order or arrange rows
mutate() create new columns
summarise() summarise values
group_by() allows for group operations


The code you write specifies the connections between the variables in your data, and the colors, points, and shapes you see on the screen.

Kieran Healy

This quote refers to the aesthetic mappings that you read about today, which is how we build a visualization from raw data with ggplot2. Sometimes it is easier to dive right in… 👙 let’s create a barchart that shows which NFL teams have won the Super Bowl the most times.


If you open the help documentation for the latter ?theme you get a rather long list of arguments that correspond to elements of the graphic we are creating. We have set the axis.title.y argument and the axis.title.x argument in the theme function to null in order to remove the default labels from the plot. Similarly, we use the angle argument of the element_text function to rotate the neighborhood names by 90°.


ggplot(data = superbowls) + 
  geom_bar(aes(x = Winner)) +
    theme(axis.text.x = element_text(size = 8, angle = 90),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())

ggplot(data = superbowls) + 
  geom_bar(aes(x = forcats::fct_infreq(Winner))) +
    theme(axis.text.x = element_text(size = 8, angle = 90),
        axis.title.x = element_blank()) +
  labs(y = "Wins")

ggplot(data = superbowls) + 
  geom_bar(aes(x = forcats::fct_infreq(Loser)), fill = "darkseagreen") +
    theme(axis.text.x = element_text(size = 8, angle = 90),
        axis.title.x = element_blank()) +
  labs(y = "Losses")


In the code chunk above, we use functions from the forcats package discussed last time to manipulate how the bars are displayed in the chart. Take a look at ?fct_infreq if you would.


If we want to change the color of all the bars in the chart, we can set the fill argument outside of the aes function. Recall that the the aes function outlines which attributes (columns) in the dataset are represented by different characteristics of the barchart.


If we wanted to have the color of the bars reflect some other characteristic of interest (that does not apply in this case, mind you), we would set the fill argument inside of the aes function.


superbowls %>%
  mutate(year = lubridate::year(actual_date)) %>%
  filter(year > 2000) %>%
ggplot() + 
  geom_bar(aes(y = forcats::fct_rev(forcats::fct_infreq(Winner))), color = "royalblue", fill = "dodgerblue") +
    theme(axis.text.x = element_text(size = 8),
        axis.title.x = element_blank()) +
  labs(title = "Super Bowl Victories", 
       subtitle = "2000 to 2022", y = "Wins") 

superbowls %>%
  mutate(year = lubridate::year(actual_date)) %>%
  filter(year > 2000) %>%
ggplot() + 
  geom_bar(aes(y = forcats::fct_rev(forcats::fct_infreq(Winner))), color = "firebrick", fill = "pink") +
    theme(axis.text.x = element_text(size = 8),
        axis.title.x = element_blank(), 
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  labs(title = "Super Bowl Victories", 
       subtitle = "2000 to 2022", y = "Wins") 


This is a fun introduction, but we have already made a few barcharts. We can also create other types of plots including boxplots, scatterplots, and line charts. The R Graph Gallery mentioned in class is a great resource because it provides example code for a wide variety of plots.


Now let’s create a line chart that shows point total for the Super Bowl over time… 😄 Line charts make use of the geom_line function and we also get to practice some of the tricks from our foray into the lubridate package from last time!


superbowls %>%
  mutate(total_score = `Winner Pts` + `Loser Pts`) %>% 
ggplot(aes(x = actual_date, y = total_score)) +
  geom_line(color = "#a3c4dc", linetype = 1, size = 1.25) +
  geom_point(color = "royalblue", shape = 20, size = 5) + 
  scale_x_date(date_breaks = "3 year", date_labels = "%Y", 
              limits = c(as_date(min(superbowls$actual_date)), as_date(max(superbowls$actual_date)))) +
  labs(title = "Combined Super Bowl Score", 
       subtitle = "1967 to 2022", y = "Points Scored", x = "") +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  theme_minimal()

# Add a smoothing element for fun and profit...
superbowls %>%
  mutate(total_score = `Winner Pts` + `Loser Pts`) %>% 
ggplot(aes(x = actual_date, y = total_score)) +
  geom_line(color = "#a3c4dc", linetype = 1, size = 1.25) +
  geom_smooth(method = "auto", se = TRUE, fullrange = FALSE, level = 0.95, color = "red") +
  geom_point(color = "royalblue", shape = 20, size = 5) + 
  scale_x_date(date_breaks = "3 year", date_labels = "%Y", 
              limits = c(as_date(min(superbowls$actual_date)), as_date(max(superbowls$actual_date)))) +
  labs(title = "Combined Super Bowl Score", 
       subtitle = "1967 to 2022", y = "Points Scored", x = "") +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  theme_minimal()


Take a minute to make sure you understand the different components of the code in the chunk above. Take a look at ?geom_smooth if you would be so kind 🙇

Your Turn

Can you modify the code above to instead display the margin of victory over time instead?


“The greatest value of a picture is when it forces us to notice what we never expected to see.”

John W. Tukey

Color is one of the most important tools we have for visualizing data and the viridis package is among the most popular options for working with color in R. There are several palettes available (see image below)

The official introductory vignette is available here if you want to learn more. Let’s create a few scatterplots to demonstrate how color can be useful for revealing patterns in data 😮 Scatterplots are especially useful for visualizing the correlation between two (or more) continuous variables of interest. Now let’s explore how viewership varies with combined score and/or year in this Super Bowl dataset to see how this works 🔍


superbowls <- superbowls %>%
  mutate(total_score = (`Winner Pts` + `Loser Pts`),
         total_eyes = WinnerMktSize + LoserMktSize,
         year = lubridate::year(actual_date))


ggplot() +
  geom_point(data = superbowls, aes(x = total_score, y = average_viewers), color = "firebrick", size = 4) +
  theme(text = element_text(size = 12),
                legend.position = "none") +
  labs(x = "Points Scored", y = "Average Viewers") +
  geom_text(aes(x = superbowls$total_score, y = superbowls$average_viewers),
    label = superbowls$year, size = 3.4, 
    nudge_x = 0.25, nudge_y = 0.25, 
    check_overlap = T)

ggplot() +
  geom_point(data = superbowls, aes(x = total_score, y = average_viewers, color = network), size = 5) +
  theme(text = element_text(size = 12),
                legend.position = "bottom") +
  labs(x = "Points Scored", y = "Average Viewers") +
  geom_text(aes(x = superbowls$total_score, y = superbowls$average_viewers),
    label = superbowls$super_bowl, size = 4.4,
    nudge_x = 0.25, nudge_y = 0.25, 
    check_overlap = T)

RColorBrewer::display.brewer.all()

ggplot() +
  geom_point(data = superbowls, aes(x = total_score, y = average_viewers, color = network), size = 3) +
  scale_color_brewer(palette = "Set1", name = "Broadcaster") + 
  theme(text = element_text(size = 12),
                legend.position = "bottom") +
  labs(x = "Points Scored", y = "Average Viewers")

ggplot() +
  geom_point(data = superbowls, aes(x = total_score, y = average_viewers, fill = year), size = 3, shape = 25) +
#  scale_fill_viridis(option = "cividis") + 
  scale_fill_viridis(option = "plasma") + 
  scale_y_continuous(labels = comma) +
  theme(text = element_text(size = 12),
                legend.position = "bottom") +
  labs(x = "Points Scored", y = "Average Viewers", fill = "Year") + 
  theme_minimal()


Take a minute to make sure you understand the different components of the code in the chunk above.

Would you say that viewership is related to combined score (i.e., a high scoring game)? Why or why not?

Another hypothesis might be that the size of the participating teams’ television markets determines viewership more than the combined score of the game.


Your Turn

Can you modify the code above to test whether total_eyes appears to be more strongly related to average_viewers than total_score was in the above scatterplots?


Yet another theory (and the importance of year in the preceding graphics) holds that the presence of Tom Brady matters for both viewership and combined score 😏 Shall we investigate?


ggplot() +
  geom_point(data = superbowls, aes(x = total_eyes, y = average_viewers, fill = tom_brady), size = 3, shape = 22) +
#  scale_fill_viridis(option = "cividis") + 
  scale_fill_viridis(option = "plasma", discrete = TRUE) + 
  scale_y_continuous(labels = comma) +
  scale_x_continuous(labels = label_number(suffix = " M", scale = 1e-6)) +
  theme(text = element_text(size = 12),
                legend.position = "bottom") +
  labs(x = "Total TV Households", y = "Average Viewers", fill = "Tom Brady Playing?") + 
  theme_minimal()

# Limit the viz to only years that Tom Brady was active...
superbowls %>%
  filter(year > 2000) %>%
ggplot() +
  geom_point(aes(x = total_eyes, y = average_viewers, fill = tom_brady), size = 3, shape = 22) +
  scale_fill_viridis(option = "cividis", discrete = TRUE) + 
#  scale_fill_viridis(option = "plasma", discrete = TRUE) + 
  scale_y_continuous(labels = comma) +
  scale_x_continuous(labels = label_number(suffix = " M", scale = 1e-6)) +
  theme(text = element_text(size = 12),
                legend.position = "bottom") +
  labs(x = "Total TV Households", y = "Average Viewers", fill = "Tom Brady Playing?") + 
  theme_minimal()


I told my daughter that I saw a deer on the way to work this morning.

She said, “How do you know it was on its way to work?” 😆

So what do you think about the “so-called Brady effect?”


A boxplot helps us visualize variability in the data beyond measures like the mean or the median. The middle part of the plot, or the interquartile range (IQR) represents the middle quartiles (or the 75th minus the 25th percentile). The line near the middle of the box represents the median (or middle value of the data set), the whiskers on either side of the IQR represent the lowest and highest quartiles of the data, and the dots are potential outliers.


The code below filters by year and generates a boxplot showing combined score for Super Bowls where Tom Brady played and when he did not.

As you learned in the reading, faceting generates multiple paneled plots based on a third variable. In the code chunk above, the ~ character is used to indicate how the existing boxplot object should be expanded—using the resolution attribute (column) from the source data. The ~ character is generally used to specify a formula in R and you saw examples in the reading related to linear regression smoothers (e.g., geom_smooth with the method argument set to “lm”).


brady_boxplot <- superbowls %>%
  filter(year > 2000) %>%
ggplot() +
  geom_boxplot(aes(x = as_factor(tom_brady), y = total_score, fill = network)) +
  scale_fill_manual(values = c("orange", "darkseagreen", "violet", "dodgerblue")) + 
  labs(y = "Points Scored", x = "Did Brady Play?", fill = "Broadcaster", 
       caption = "Note: Brady only played on ABC and NBC once.") + 
  theme(text = element_text(size = 25),
        axis.title.x = element_text(face = "bold", color = "royalblue", size = 24),
        axis.title.y = element_text(face = "bold", color = "red", size = 12),
        legend.position = "bottom") +
  theme_minimal()

brady_boxplot + facet_wrap(~ network,  ncol = 4, scales = "free")


Your Turn

Can you modify the code above to test whether average_viewers is also higher in Super Bowls played 2001 that featured Tom Brady by generating one or more boxplots?



You have reached the end!